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ABSTRACT 

Two sets of mathematical reasoning and two sets of 
verbal comprehension items were cast into each of three 
formats — constructed response, standard multiple-choice, and Coombs 
multiple-choice — in order to assess whether tests with indentical 
content but different formats measure the same attribute, except for 
possible differences in error variance and scaling factors. The 
resulting 12 tests were administered to 199 eighth-grade students. 
The hypothesis of equivalent measures was rejected for only two 
comparisons: the constructed response measure of verbal comprehension 
was different from both the standard and the Coombs multiple-choice 
measures of this ability. Maximum likelihood factors analysis 
confirmed the hypothesis that a five factor structure will give a 
satisfactory account of the common variance among the 12 tests. As 
expected, the two major factors were mathematical reasoning and 
verbal comprehension. Contrary to expectation, only one of the other 
three factors bore a (weak) resemblance to a format factor. Tests 
marking the ability to follow directions, recall and recognition 
memory, and risk taking were included, but these variables did not 
correlate as expected with the three minor factors. (Author/MV) 
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On the Equivalence of Constructed-Response and 
Multiple-Choice Tests 
Ross E. Traub 
Ontario Institute for Studies in Education 
Charles W. Fisher 
Far West Laboratory for Educational Research and Development 

Abstract 

Two sets of mathematical reasoning and two sets of verbal comprehension 
items were cast into each of three formats — constructed response, standard 
multiple-choice, and Coombs multiple-choice — in order to assess whether 
tests with identical content but different formats measure the same 
attribute, except for possible differences in error variance and scaling 
factors. The resulting 12 tests were administered to 199 eighth-grade 
students. The hypothesis of equivalent measures was rejected for only two 
comparisons: the constructed response measure of verbal comprehension was 
different from both the standard and the Coombs multiple-choice measures of 
this ability. Maximum likelihood factor analysis confirmed the hypothesis 
that a five factor structure will give a satisfactory account of the common 
variance among the 12 tests. As expected, the two major factors were 
mathematical reasoning and verbal comprehension. Contrary to expectation, 
only one of the other three factors bore a ^eak) resemblance to a format 
factor. Tests marking the ability to follow directions, recall and recogni- 
tion memory, and risk taking were included, but these variables did not 
correlate as expected with the three minor factors. 
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A question of enduring interest for students of educational measure- 
ment is whether tests that employ different response formats, but that in 
other respects are as similar as possible, measure the same attribute. This 
question has been asked of constructed-response as compared with multiple- 
choice tests (Cook, 1955; Davis & Fifer, 1959; Heim & Watts, 1967; Vernon, 
1962) and of multiple-choice tests having standard as compared with non-standard 
formats (Dressel & Schmid, 1953; Coombs, Milhollatici & Womer, 1956; Rippey, 1968; 
Hambleton, Roberts & Traub, 1970). Results of available research suggest 
that the distributions of scores on tests employing different formats cannot 
be assumed to have the same mean and standard deviation, even when the tests 
are administered to the same group of examinees or to groups that differ 
only because of random allocation of examinees to groups. In addition, the 
reliability and criterion correlation coefficients associated with 
different response formats cannot be assumed to be the same. These results 
are not, of course, sufficient evidence to reject the null hypothesis that 
tests with different formats measure the same attribute. It is possible to 
account for differences in means and standard deviations through appeal 
to possible differences in the scales of measurement associated with 
different formats; and differences in reliability and criterion correlation 
coefficients can be attributed to possible differences in the relative 
amount of error variance associated with the different test formats and 
also to possible differences in the scales of measurement such that one scale 
is a nonlinear transformation of the other. 
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For several years now statistical procedures have been in existence 
for testing the null hypothesis of equivalence of measures. Early exemplars 
of these procedures (Lord, 1957; McNemar, 1958) were somewhat difficult to 
use. More recently, however. Lord (1971) has presented a statistically 
rigorous test based on work by Villegas (196A) that is relatively easy to 
employ. This test is of the hypothesis that "two sets of measurements 
differ only because of a) errors of measurement, b) differing units of 
measurement, and c) differing arbitrary origins for measurement" (Lord, 
1971, p.l). Clearly, this test accounts for ali the previously described 
reasons for differences between the measurements yielded by two different 
test formats except those differences caused by the fact that one scale is 
a nonlinear transformation of the other. 

In addition to Lord's procedure, recent developments in factor 
analysis make it possible to test hypotheses about the relationship among 
measurements arising from tests with different formats. These develop- 
ments, subsumed under the heading, confirmatory factor analysis, have been 
made principally by Joreskog (1969, 1971) and McDonald (r;69). 

The purpose of the present investigation was to test the equiva- 
lence of three response formats, each applied to items from two different 
content domains. The foinnats were (i) constructed-response , (ii) standard 
multiple-choice, in which the examinee is instructed to choose one option 
per item, the one he thinks is correct, and (iii) non-standard multiple- 
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choice, in which the exaninee is asked to identify as many of the incorrect 
options as he can. This latter procedure was described bv Coombs, 
Milholland and Womer (1956) and is hereafter called the Coon^bs fov: a^ or 
the Coombs procedure. The two content domains were verbal comprehension, 
as defined operationally by questions on the meaning of words, and mathe- 
matical reasoning, as defined operationally by questions about a variety 
of mathematical concepts and skills, and by problems, the solution of which 
depends on the ability to apply a variety of mathematical concepts and 
skills. 

The motivation for studying the equivalence of measurements arising 
from different response formats was to gain some further understanding of 
partial knowledge. The standard multiple-choice format does not assess and 
credit partial knowledge, the kind of knowledge that enables an examinee to 
respond at a better-than-chance level to items that cannot with certainty 
be answered correctly. The Coombs format nullifies this criticism because 
it enables an examinee to gain partial credit by identifying one or more 
of the incorrect options to an item, even when not all of the incorrect 
options can be identified. What remains at issue, in the face of this 
logical analysis, is whether measurements based on the Coombs format 
reflect the same attribute as measurements based on the standard multiple- 
choice format. For example, it might be the case that the longer and more 

involved- instructions associated with the Coombs format introduce the 

3 

factor of following directions into the measurements, a factor that might 
not be present in measurements based on the standard multiple-choice for- 
mat with its simpler instructions. 
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A comparison of the Coombs and standard multiple-choice formats 
appears interesting in its own right, but both these formats can be viewed 
as ways to simplify and objectify the scoring of constructed-response items 
To the extent that this view of objectively scorable tests is accepted, 
interest extends to a comparison of measurements derived from all three 
formats. Again, the issue is whether the measurements derived from a 
constructed-response format reflect the same attribute as measurements 
derived from objectively scorable formats. For example, items that are 
designed to test factual knowledge and thdt involve the constructed- 
response format can be answered by the exercise of recall memory. The 
same items when cast into a multiple-choice format can be answered by the 
exercise of either recall or recognition memory. In addition, multiple- 
choice formats are more clearly subject to the influence of risk taking 
(guessing) behavior than is a constructed-response format. In the case 
of the constructed-response format, an examinee can guess only if he makes 
the effort to generate a response. This fact alone operates against risk- 
taking behavior. In addition, the set of possible responses is probably 
quite large, although for any examinee it consists of only those possi- ' 
bilities he can generate and this number is not necessarily large; the 
larger the set of possible responses, the less likely the examinee is to 
guess correctly and the less that risk-taking can influence his test score. 
On the other. hand, in the case of choice formats, the set of possible 
responses is rsmall, in addition to being precisely the same for every 
examinee. This means that the probability of a correct guess is suffi- 
ciently large for risk-taking to influence test scores significantly. 
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Fortunately, the topic of risk-taking on multiple-choice tests has been the 
subject for considerable research and measures of individual differences in 
risk-taking have been proposed; hence it is a factor that can be included 
as an independent variable in research studies. 

In summary, the main purpose of the present study was to test the 
equivalence of measurements obtained using constructed-response , standard 
multiple-choice and Coombs response formats • Secondarily, the study was 
designed to identify format factprs and to study the association between 
these factors, if found, and the psychological attributes of following 
directions ability, recall memory, recognition memory, and risk taking. 

Method 

Ins tr ument at ion 

To attain the main purpose of this investigation, it was necessary 
to impose two constraints on the measures devised for each content domain: 
(1) The content of the measures for one test format had to be as similar 
as possible to the content of the measures for another test format. This 
constraint was satisfied by using the same set of item stems for all three 
test formats and the same Item response options for both the standard 
multiple-choice and Coombs formats. (2) The number of measures per response 
format had to be at least two In order to implement Lord's (1971) procedure 
for testing equivalence. This constraint was satisfied by forming two sets 
of verbal comprehension Items and two sets of mathematical reasoning items. 
The two sets of verbal comprehension items were drawn from a pool formed by 
the items marking the verbal comprehension factor in the Kit of Reference 
Tests for Cognitive Factors (French, Ekstrom & Price, 1963); the two sets of 
mathematical reasoning items were drawn from a pool consisting of the 
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mathematics items in Forms 3A and AA of the 1957 edition of SCAT (ETS, 1957), ' 
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the items marking the general reasoning factpr in the Kit of Reference Tests 
for Cognitive Factors (French, et al, 1963), and the items in the Canadian 
New Achievement Test in Mathematics (OISE, 1965). The large pools of verbal 
comprehension and mathematical reasoning items were pretested in their 
standard multiple-choice formats, and under instructions to answer every 
item with no penalty for wrong answers, to approximately 100 students at 
the same eighth-grade level as the students who subsequently participated in 
the study proper. These pretest data were used to compute indices of item 
difficulty — the percentage of correct responses — and item discrimination — 
the item-total biserial correlation coefficient. (The total score used in 
the computation of a biserial correlation coefficient was the sum of scores 
on all the items included in the pool for a given content domain*) The two 
S2ts of items drawn from the verbal comprehension pool each contained 50 
.items, the two sets of items from the mathematical reasoning pool each 
contained 30 items. The item sets for a content domain were matched for 
pretest indices of difficulty, with the avera';;^^ difficulty being .50 in 
each ca&e. The item sets were also matched as closely as possible for 
values of the pretest indices of discrimination. 

The secondary purpose of the study was to seek response format 
factors and, if such factors were isolated to. study .the degree of 
association between the factors and measures of possibly related 
psychological attributes. The search for format factors, given the 
design of the study, took place among the covariances between measures 
having the same format but different content. In other words, a factor 
defined by the constructed-response format was conceived as one that 
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would be associated v/ith the constructed-response measures of both the 
verbal comprehension and mathematical reasoning domains of content and 
not with the standard multiple-choice and Coonibs measures of these domains • 
Fornat factors associated with the standard multiple-choice and Coombs 
formats would be similarly defined. 

The variables of following directions, recall memory, recognition 
memory and propensity for risk taking (on multiple-choice tests) were 
measured for the purpose of studying the association beti^een these 
variables and format factors, if such factors were identified. The 
ability to follow directions was measured by two instruments that had 
been used previously by Traub (1970) and prepared as adaptations of a 
test devised originally by J. W. French. Two measures of recall memory 
were employed, both of which were tiaken from the tests of associative 
memory contained in the Kit of Reference Tests for Cognitive Factors 
(French, Ekstrom & Price, 1963) • Recognition memory was assessed by 
two measures , both adaptations of materials developed by Duncanson 
(1966).^ The fourth variable, risk taking, was measured using an instru- 
ment developed by Traub and Hambleton (1972). The rationale for this 
instrument was proposed by Swineford (1938; 1941) • • 
Design and Subjects 

Lord's procedure for testing the equivalence of measures and the 
method of confirmatory factor analysis are applicable to data obtained 

from a single group of subjects. In this study usable data were obtained 
on 199 eighth grade students (93 females), with a mean age on September 1, 

1971 of approximately 13 years, 8 months (the standard deviation of the 

age distribution was approximately 8 months) • During the 1971-72 academic 
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year these students attended one of the twp junior high schools that 
:-:.>operated in the study. Both cooperating schools were located in East 
York, a borough of Metropolitan Toronto. The neighborhoods served by 
these schools were described by the school principals as a mixture of d?" 
lower and middle classes.^ 

In any study involving tests that differ only in response format, 
care must be taken to minimize memory effects. This was done in the 
present study by scheduling the tests so that there was a two-week interval 
between each administration of the same set of items and by administering 
the constructed response formats first — Heim and Watts (1966) found that 
carry-over from one administration of a set of vocabulary items to a 
second administration of the items in a different fonnat was markedly less 
when the constructed-response format preceded the multiple-choice format 
than when the reverse order was followed. 

It was atiticipated, and subsequent events tended to confirm that 
motivating students to work all versions of the verbal comprehention and 
mathematical reaisoning tests would be a problem. The following steps were 
taken to minimize this problem: (i) students were told at the first 
administration that the study was designed to find out whether people 
score better when a test involves one kind of response format than when it 
involves another, that they would be tested periodically over a period of 
weeks and that their scores on the tests would be sent to them individually 
(this promise was kept in that copies of individual report forms were 
delivered to the school when the scoring had been completed); (ii) the 
standard multiple-choice format was introduced with the comment that it 
would give the student a chance to improve his performance on the 
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tests. (Ill) the Coombs format was Introduced as 

another chance to Improve on past perforoancS. 

Two other critical conditions of the test adninlstratlons ,,cre the 
scoring instructions a.d the time limits provided for the administration 
Of each test. On all tests employing a constructed-response format-two 
verbal comprehension. u,o mathematical reasoning. t„o following directions 
two recall memory tests-students were Informed that the n«her of 

comprehension and mathematical reasoning tests with a constructed-response 
format. It was to their benefit to show all their worR because partial 
credit could be obtained for wor. done on ,„estlons answered Incorrectly 
"X of the remaining measures, four verbal comprehension and mathematical 
reasoning tests and two tests of reco.n, : ion memory, were presented In a 

in view of.thls..,the students were. Instructed to answer ,very ,uestlon.and 
guess If necessary. . Ihe four tests presented In the-Coombs format ■ 
and the measure of rls. taUng Involved rather elaborate scoring Instruc- 
tions with a complex system of rewards and penalties. The students were 
informed of the scoring system In each case and several examples were 
considered to demonstrate the potential effect of the scoring system. 

As regards the tine limits for the test, fh., 

une cesLs, they are reported in 

Table 1 for each test. ThpQP i-Tm^i-^ 

. These limits were established on the basis of 



Insert Table 1 about h 
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pilot administrations of the tests and (except in the case of the memory 
tests) were generous, even for the Coombs format which was most t?'.me con- 
suming, so as to achieve essentially pover conditions. The time limits for 
the tests of recall memory are those specified by French et al (1963); 
the limits for the recognition memory tests were set on the basis of pre- 
test results to achieve a satisfactory distribution of scores (i.e. a 
distribution with a range approximately three standard deviations either 
side of the mean) . 
Scoring 

Special keys were prepared for the constructed-response versions 
of the verbal comprehension and mathematical reasoning tests. These keys, 
which indicated to the marker how to award partial credit for certain wrong 
answers were applied to responses obtained from a pretest and were revised 
as required in the light of apparent inadequacies. The final forms of the 
keys were applied by independent scorers to a random sample of 50 
constructed-response answer booklets from one school for Form B of the tests 
for both content domains. The correlation between the marks assigned by 
^the scorers was 0.97 for the verbal comprehension test and 0.98 for the 
mathematical reasoning test. 

All other tests used in the study could be scored objectively. 
(Copies of tests and scoring keys are available from the authors 
on request.) 
A Note on Sample Size 

The sample size of 199 represents approximately one-half the total 
number of eighth-grade students attending the two cooperating schools 
during the time the data were being collected. The data from the other 
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students were discarded for one of seyer&l reasons: (i) some students were 
so-called New Canadians and had difficulty understanding written English; 
Cii) some students were absent from school on one or more of the seven days 
on which the tests were administered; (iii) some students attempted fewer 
than ten of the questions on a test or marked their multiple-choice answer 
sheets following a clear pattern unrelated to the pattern of correct 
answers and were judged to have paid little attention to the task; 
(Iv) some students were observed to copy answers from other students during 
one or more of the testing occasions. The frequency of occurrence of 
reasons (iii) and (iv) was zero for the first two testing occasions hut 
over the next four occasions, when the test items were repeated in the 
different formats, this frequency departed quite substantially from zero. 
The occurrence of this type of behavior indicates the difficulty that is 
encountered in sustaining student motivation when tests. are administered 
repeatedly. 

Results and Discussion 

Basic Statistics 

Means and standard deviations for all 19 measures, coefficients a 
for the 12 mathematical reasoning and verbal comprehension measures, and 
intercorrelations amongst all 19 measures are presented in Table 1. 

Alpha coefficients are not reported for the seven marker variables; 
the calculation of a is impossible for the measure of risk-taking and 
cannot be justified for speeded tests such as thr tests of recall and 
recognition memory. Despite this, the results sugge&t that the reliabili- 
ties of at least the memory tests were relatively low. The evidence for 
this suggestion consists of the correlations between the pairs of tests 
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designed to measure the recall and recognition memory factors. Although 
these pairs of tests are not parallel in content, their intercorrelations 
are much loiter than would be expected for tests that reliably measure the 
same ability. 
Equivalence of Measures 

All possible pairs of tests having the same content and different 
formats were assessed for equivalence using Lord's (1971) procedure .^'^ 
On the basis of the results achieved for the measures of mathematical 
reasoning, the hypothesis of equivalence cannot be rejected for any of the 
three possible contrasts of test formats. The results for the measures of 
verbal comprehension indicate that the hypothesis of equivalence can be 
rejected for two contrasts— -constructed- response vs. standard multiple- 
choice and constructed-response vs. Coombs. It was not possible to reject 
the null hypothesis of equivalence for the contrast between the standard 
multiple-choice and Coombs formats. 

To ascertain whether factors associated with test format would over- 
ride those associated with content, three other pairings were considered. 
In each case response format was held constant and content was varied; 
that is, the constructed response versions of mathematical reasoning and 
verbal comprehension were tested for equivalence, as were the standard 
multiple-choice and Coombs versions of these testt'^ The hypothesis of 
equivalence was rejected for all three of these comparions. 

The foreging results indicate that the tests of mathematical 
reasoning measured the same attribute regardless of response format, where- 
as the attributes measured by tests of verbal comprehension varied as a 
function of response format. A conception of mental funtioning that would 
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account for this finding is the following • (1) Determining the correct 
answer to a mathematical reasoning item involves working out the answer 
to the item regardless of the fonriat of the test. The work is recorded 
in the case of constructed response items and is used as a basis for 
choosing a response in the case of the standard multiple-choice and 
Coombs formats. (2) Determining the correct answer to a verbal compre- 
hension item involves recalling definitions when a constructed-response 
format is used. VHien standard multiple-choice and Coombs formats are 
used, however, it is only necessary to recognize the correct answer, and 
recognition is facilitated by ruling out implausible^ response options. 
This conception of the difference in mental operations that are employed 
in working mathematical reasoning as compared with verbal comprehension 
tests carries the implication that the main advantage of the Coombs 
response format — the possibility of revealing partial knowledge by identi- 
fying one or more response options as incorrect but not identifying all 
the incorrect options — would be utilized to a greater extent with the 

verbal comprehension than the mathematical reasoning items. Statistics 
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confirming this implication are reported in Table 2. 
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Format Factors 

The intercorrelations among the 12 measures of mathematical reason- 
ing and verbal comprehension were subjected to confirmatory factor analysis 
using the COSA-I program of McDonald and Leong (ITote 1) in an effort to 
identify format factors. A format factor is by definition a factor 
associated with tests employing the same response format, regardless of 
test content. In line with the hypothesized existence of format factors 
is a five factor structure involving two correlated factors, one marking 
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mathematical reasoning, the other marking verbal comprehension, and three 
orthogonal format factors, one for each of the three response formats 
included in the study. It proved possible to obtain a satisfactory fit of 
a five-factor structure provided that, in addition to the specified five 
factors, the possibility of correlated unique factors among all six measures 
of mathematical reasoning and among all six measures of verbal comprehension 
was allowed. For this structure, the approximate statistic arising from 
the goodness-of-fit test was 21.297, which, with 11 degrees of freedom, has 
a probability of chance-occurrence under the null hypothesis of slightly more 
th an 0.03. Estimated values of the parameters of this structure are given 
in Table 3. 



Insert Table 3 about here 

Any attempt to deviate from this structure by fitting fewer factors 
resulted in significantly larger statistics— the increase in the value of 
X^ was approximately equal to twice the increase in the number of degrees of 
freedom associated with each decrement of one in the number of factors from 
five to two factors. Any attempt to equate to zero some or all of the corre- 
lations among unique factors for tests having the same content resulted 
in the occurrence of a so-called Heywood case, in which the estimated unique 
variance of at least one of the 12 tests was a non-trivial but meaningless 
negative number. 

There are several points worth noting about the structure reported 
in Table 3: 

(i) The two main factors are mathematical reasoning (Factor I) 
and verbal comprehension (Factor II). As expected these factors are highly 
inter correlated. 
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(ii) The sets of items comprising forms A and B of each content 
doaiain are far from parallel, regardless of the format in which they are 
presented. Had these item sets been parallel, the factor coefficients and 
unique variances of forms A and B for a given format and content domain 
would have been the same. But when this kind of constraint was imposed on 
the structure fitted to the data, the result was very unsatisfactory. 

(ill) The fitted structure ignores a substantial amount of the 
variance held in common among the six tests of verbal comprehension. This 
is clear from the size of the off-diagonal entries in U for the six verbal 
tests. These entries are considerably larger on average than the corres- 
ponding entries for the six mathematical reasoning tests. 

(iv) It was hoped that factors three, four and five would look 
like format factors. In order to have the appearance of a format factor, 
the coefficients of the four tests sharing the same response format should 
all have the same algebraic sign and be large enough in absolute magnitude 
to be distinguishable from zero. The only factor of the three that comes 
close to satisfying these conditions is the third, which can, perhaps, be 
called a constructed-response factor. The coefficients on the fourth and 
fifth factors, however, do not meet the conditions for format factors. 

As a further guide to interpreting the factor structure reported 
in Table 3, an "extension analysis" (Lord, 1956, pp. 40, 42) was per- 
formed in which least squares estimates of the coefficients of the seven 
marker variables on the five factors were obtained. These coefficients 
are reported in Table 4. Several observations are supported by the 
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(i) The tests of following directions ability have sizeable 
coefficients on the nathenatical reasoning and the verbal comprehension 
factors (I and II, respectively). These tests do not, contrary to expecta- 
tion, have substantially larger coefficients on the fifth factor, the one 
defined by tests with the Coombs format, than they have on the third and 
fourth factors, those defined by the constructed-response and standard 
multiple-choice tests, respectively. 

(ii) The results for the tests of recall memory are interesting in 
that they have positive coefficients on the mathematical reasoning factor 
and negative coefficients on the verbal comprehension factor. The positive 
coefficients on mathematical reasoning may reflect nothing more than that 
the two recall memory tests required examinees to form associations between 
pictures or object labels and numbers. It is possible, however, to use 
these results as partial support for the previously described theory of 
examinee behavior on constructed-response as compared with multiple-choice 
tests. According to the theory, examinees respond to mathematical reasoning 
items, regardless of test format, by doing the operations needed to derive 
answers to the questions. This is an activity which presumably would draw 
heavily on recall memory. The factor structure provides support for this 
suggestion. The theory also predicts (a) a positive association between 
recall memory and constructed-response tests of verbal comprehension and 
(b) a positive association between recognition memory and multiple-choice 
tests of verbal comprehension. Because the verbal comprehension factor in 
this study is marked by both constructed response and multiple-choice tests, 
it is difficult to predict just what associations there should be between 
the verbal comprehension factor and the tests of recall memory and 
recognition memory. The obtained negative coefficients for recall memory 
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on the verbal comprehension factor are something of a puzzle—why should 
performance of these tests be hampered by recall memory?— but the positive 
coefficients for recognition memory on this factor are not surprising, 
although their size is smaller than might be expected. 

(iii) Neither the recall memory nor the recognition memory tests had 
coefficients the size they were expected to have on the factors marked by 
tests with different formats, i.e., high coefficients for recall memory on 
the third factor and high coefficients for recognition memory on the fourth 
and fifth factors. 

(iv) The positive coefficient for the measure of risk taking on 
the verbal comprehension factor is most probably a reflection of the fact 
that the content of the risk-taking measure consisted of vocabulary items. 
The negative coefficients for this measure on the first, fourth and fifth 
factors are not so large as to suggest an important negative association 
between risk taking behavior and the abilities defined by these factors. 

Conclusions 

The main conclusion of this study concerns the equivalence of 
measurements arising from tests based on the same content but employing 
different formats. When content was held constant and allowance was made 
for differences due to errors of measurement and scale parameters, i.e., 
units and origins, the tests of mathematical reasoning that were employed 
were equivalent regardless of format, but the tests of verbal comprehension 
were not. In particular, the free-response tests of verbal comprehension 
seemed to measure aomething different from standard multiple-choice and 
Coombs tests of this ability, although the standard multiple-choice and 
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Coombs formats themselves yielded equivalent measures of verbal comprehen- 
sion. This finding, if found to be generally true, has obvious methodolo- 
gical implications for educational and psychological researchers. The 
design of instruments to measure verbal comprehension must be done with the 
full awareness that different formats may well yield measures of different 
abilities. This same concern is apparently not necessary for tests of 
mathematical reasoning. 

The foregoing, major conclusion of the study cannot go unqualified. 
In this study, all the pairs of tests with the same format, regardless of 
whether the content consisted of mathematics questions or vocabulary items, 
were not statistically parallel (i.e., they had different means, variances, 
reliability coefficients, and intercorrelations with other yariables) . 
Further evidence of the lack of parallelism was obtained from the factor 
analyses that were attempted, in that a parallel forms factor structure did 
not provide a satisfactory fit to the matrix of intercorrelations among the 
12 mathematical reasoning and verbal comprehension tests. The results of 
the only factor analysis that gave satisfactory results indiciate that the 
unique factor, including error of measurement, for one form of a test was 
correlated with the unique factor for the "parallel" form of the test. 
The statistical test of equivalence provided by Lord assumes the existence 
of "replicate" measurements — truly parallel tests would provide replicate 
measurements — having errors of measurement that are uncorrelated across 
replications (Lord, 1971, p. 2). Nothing peems to be known about the 
robustness of Lord's test when this assumption is violated. 



21 



-20- 

The second, and very much weaker conclusion of this study, is that 
evidence was obtained of the existence of a constructed-response format 
factor. The evidence for this factor is weak because, although the con- 
structed response test of mathematical reasoning and verbal comprehension 
had, as expected, positive coefficients on this factor, all the coefficients 
were small in absolute magnitude and the factor did not have the expected 
associations with the marker variables. 

The primary reason for undertaking the study, to identify format 
factors and gain an understanding or explanation of these factors by relating 
them to marker variables for following-directions ability, recall and 
recognition memory, and risk taking, appears to have been unjustified. It 
was not possible to identify format factors that were clearly marked and 
that accounted for a substantial amount of variance common to the tests 
having the same format regardless of content. 
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Footnotes 

■'"The authors are indebted to a large number of people who helped In 
one way or another to make the study possible: Mary Cockell, Liz Falk, 
Colin Fraser, Mohindra Gill, Lorne Gundlack, Gladys Kachkowski, Kuo Leong, 
Joyce Townsend, Pat Tracy, Wanda Wahlstrom, Da™ Whitmore, 

2 

This project was supported by a research grant from the Office of 
the Coordinator of Research and Development, OISE. 
3 

The ability to follow directions has been called integration and 
defined as the ^'ability simultaneously to bear in mind and to combine or 
integrate a number of premises, or rules, in order to produce the correct 
response" (Lucas & French, 1953, p. 3). 
4 

All ETS items were used with the full knowledge and permission of 
the publisher. 

^These materials were usad with permission, 
g 

We are grateful to Mr. Gordon Brown, Principal, St. Clair Junior 
School, and Mr. Frank Gould, Principal, Westwood Junior School, their teach- 
ing staffs and students for cooperating in the study. 

^The strategy of testing all possible pairs of instruments for 
equivalence can be criticized because the tests that are made are not 
linearly independent. In this specific instance, however, a better 
strategy, one that would avoid this criticism, did not suggest, itself. 

8 

Interested readers can obtain the statistics that were computed 
at each stage of each application of Lord's test of equivalence hy writing 
to the senior author. 
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9 

The fraquency of partial knowledge responses mighL reasonably be 
expected to ir'vrease as test difficulty increased • Differences in test 
difficulty doas not, on average, appear to account for the present results • 
The mean scores on the multiple^choice versions of the mathematical reason- 
ing and verbal comprehension tests are approximately equal to one-half the 
total number of items in the tests • 
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Table 2 







Means and Standard Deviations of 


the Distributions of 


the 








Frequencies with which Students, Employed each Type 


of Response in the 


Coombs 








appear in 


brackets) 










-, Associated 
Response 

oCOlG 


Mathematical 
Reasoning 


Verbal 
Comprehension 






Fo rrn A 


Form II 


Fo I, lit A 


r orrn a 


a) 




vrong opt.lons 4 


13.5 


14.5 


23.6 


21.0 








(6.4) 


(6.8) 


(10.8) 


(10.3) 


b) 


3 


wrong options 3 


.5 


.6 


3.7 


4.2 








(1.0) 


(1.2) 


(4.6) 


(4.7) 


c) 


2 


wrong options 2 


.4 


.6 


1.5 


1.9 








(1.0) 


(1.6) 


(2.5) 


(3.1) 


d) 


1 


wrong option 1 


.5 


.3 


1.1 


1.0 








(1.9) 


(1.0) 


(2.7) 


(3.0) 


e) 


0 


wrong options 0 


1.0 


.7 


1.0 


1.0 








(2.2) 


(1.4) 


(2.6) 


(2.8) 


f) 


3 


wrong options plus correct answer -1 


12.9 


12.1 


15.7 


16.6 








(6.0) 


(6.6) 


(8.7) 


(10.0) 


g) 


2 


wrong options plus correct answer -2 


.7 


.7 


2.2 


3.1 








(1.6) 


(1.6) 


(3.0) 


(4.2) 


h) 


1 


wrong option plus correct answer -3 


.3 


.4 


.7 


.9 








(.8) 


(1.1) 


(1.5) 


(1.6) 


i) 


0 


wrong options plus correct answer -4 


.1 


.1 


.2 


.2 








(.4) 


(.4) 


(.6) 


(.6) 


j) 


All options marked -5 


.3 


.1 


.2 


, .2 








(.9) 


(.5) 


(.6) 


(.5) 






•• • ' . ' 'a 
Percentage of partial Knowledge responses 


8.3 


9.0 


18.8 


22.5 



^Computed from the formula: ( Sum of means for responses b, c» d, g> h, i) ^ 

Sum of means for all responses 
The sura of means for all responses differs from 30 or 50 because of rounding error. 
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